Search CORE

50 research outputs found

An introduction to crowdsourcing for language and multimedia technology research

Author: A. Doan
C. Callison-Burch
C. Rashtchian
G. Paolacci
G. Pickard
J. Ross
L. Ahn von
L. Ahn von
M. Larson
O. Alonso
R. Snow
S. Novotney
T. Yan
V.C. Rayker
V.S. Sheng
W. Mason
W. Willett
W.S. Lasecki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Language and multimedia technology research often relies on large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible

Crossref

Irish Universities

DCU Online Research Access Service

Quality Assessment in Crowdsourced Indigenous Language Transcription

Author: C. Callison-Burch
D.P. Anderson
H. Suleman
K. Williams
L. Ahn Von
L. Yujian
P. Shachaf
S. Nowak
V.I.. Levenshtein
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2013
Field of study

The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialised notation system. Previous attempts have been made to convert the approximately 20000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. In this paper, a crowdsourcing method is proposed to transcribe the manuscripts, where non-expert volunteers transcribe pages of the notebooks using an online tool. Experiments were conducted to determine the quality and consistency of transcriptions. The results show that volunteeers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80% for |Xam text and 95% for English text. When the |Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75%, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers

Crossref

UCT Computer Science Research Document Archive

Taking MT evaluation metrics to extremes : beyond correlation with human judgments

Author: Bahdanau Dzmitry
Baig Taimur
Banerjee Satanjeev
Berentsen Geir Drage
Bojar Ondřej
Callison-Burch Chris
Callison-Burch Chris
Coughlin Deborah
Culy Christopher
Denkowski Michael
Fomicheva Marina
Giménez Jesús
Graham Yvette
Hjort Nils Lid
Junczys-Dowmunt Marcin
Levene Howard
Liu Ding
Lucia Specia
Marina Fomicheva
Moore Robert C.
Nießen Sonja
Papineni Kishore
Snover Matthew
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Sutskever Ilya
Tillmann Christoph
Williams Evan James
Publication venue: 'MIT Press - Journals'
Publication date: 12/06/2019
Field of study

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online

An Introduction to Crowdsourcing for Language and Multimedia Technology Research

Author: A. Doan
C. Callison-Burch
C. Rashtchian
G. Paolacci
G. Pickard
J. Ross
L. Ahn von
L. Ahn von
M. Larson
O. Alonso
R. Snow
S. Novotney
T. Yan
V.C. Rayker
V.S. Sheng
W. Mason
W. Willett
W.S. Lasecki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Findings of the 2009 Workshop on Statistical Machine Translation

Author: Callison-Burch C.
Callison-Burch C.
Koehn P.
Koehn P.
Monz C.
Monz C.
Schroeder J.
Schroeder J.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

This paper presents the results of the WMT09 shared tasks, which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 87 machine translation systems and 22 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality, for more than 20 metrics. We present a new evaluation technique whereby system output is edited and judged for correctness